\chapter{Introduction}

To navigate in an unknown environment, a mobile robot needs to build a map of the environment and localize itself in the map at the same time. The process addressing this dual problem is called \gls{SLAM}. In an outdoor environment, this can generally be solved by a GPS which provides a good accuracy for the tasks the robot can take on. However, when moving indoor or in places where the GPS data is not available, or not reliable enough, it can become difficult to estimate the robot's position precisely and other solutions have to be found. 

The main problem raised with \gls{SLAM} comes from the uncertainty of the measurements, due to the sensory noise or technical limitations. Probabilistic models are widely used to reduce the inherent errors and provide satisfying estimations. While this process is generally based on data provided by sensors such as laser scanners, combined with the odometry, \gls{VSLAM} focuses on the use of camera, as illustrated in figure~\ref{fig:vslam_overview}.

\begin{figure}[H]
\centering
\includegraphics[width=1.0\textwidth]{figures/visual_slam}
\caption{Concept of Visual SLAM. The poses of the camera (hence, the robot for a fixed camera) are determined from video data. The estimations generally drift with respect to the real trajectory, and the uncertainty grows over time.}
\label{fig:vslam_overview}
\end{figure}

\clearpage
\section{Context}

Currently, most of robotic mapping is performed using sensors that offers only a 2D cross section of the environment around them. One reason is that acquiring high quality 3D data was either very expensive or had hard constraints on the robot movements. Therefore, research has mainly focused on laser scanners to solve the \gls{SLAM} problem, although there are some methods making use of stereo and monocular cameras~\cite{KonoligeA08}~\cite{StrasdatMD10}. However, recently there has been a great interest in processing data acquired using depth measuring sensors due to the availability of cheap and efficient RGB-D cameras. For instance, the Kinect Camera developed by Prime Sense and Microsoft has considerably changed the situation, providing a 3D camera at a very affordable price. Primarily designed for entertainment, it has received a warm welcome in the research community, especially in robotics.

In the past, to solve the inherent problem of drift and provide a reliable estimation of the camera poses, most of the projects have used techniques such as \gls{EKF} or particle filters~\cite{Thrun_2005}. More recently, several methods rely on pose graphs to model and improve the estimations~\cite{Thrun05_GraphSLAM}~\cite{grisetti07rss}~\cite{g2o_2011}~\cite{hogman_2010}. Some projects making use of both RGB-D data and graph optimization~\cite{Henry_RGBD_2010}~\cite{engelhard11euron-workshop} illustrate well the interest for such an approach.

The work presented in this thesis is done at the \gls{CVAP} at KTH, the Royal Institute of Technology, in Stockholm. Since 1982, the research at \gls{CVAP} focuses on the topics of computer vision and robotics. 

\clearpage
\section{Goals}

The main goal of this thesis is to build a 3D map from RGB and depth information provided by a camera, considering a 6~\gls{DOF} motion system. The hardware device used for the experimentations is the Microsoft Kinect but this work could be extended to any system providing video and depth data.

The \gls{VSLAM} process can be described as estimating the poses of the camera from its data stream (video and depth), in order to reconstruct the entire environment while the camera is moving. As the sensory noise leads to deviations in the estimations of each camera poses with respect to the real motion, the goal is to build a 3D map which is close, as much as possible, to the real environment.

One objective is to have an overview of the different methods and techniques, so the problems can better be identified and then analyzed more deeply after doing some experiments. However, this work will focus on the use of visual features. Though a \gls{VSLAM} system is intended to be used in real time, some of the processing may be done without considering the performance as a main priority, and could therefore be delayed. The rendering of the scene is also out of scope of this study.

\begin{figure}[H]
\centering
\includegraphics[width=1.0\textwidth]{pictures/kinect_mount}
\caption{Microsoft Kinect mounted on a mobile robot platform at CVAP (KTH)}
\end{figure}

\clearpage
\section{Thesis outline}

The rest of the document is structured as follows:
\begin{description}
\item[Chapter \ref{chap:background}] presents the background and the underlying concepts that are most commonly used in this area, with features and methods commonly used in computer vision, and general notions about SLAM.
\item[Chapter \ref{chap:features}] presents the feature matching between a couple of frames, and how a 3D transformation can be computed from these associations using the depth data.
\item[Chapter \ref{chap:map}] describes how a map can be built, by estimating the camera poses through the use of a pose graph, refining them through the detection of loop closures, and finally performing a 3D reconstruction.
\item[Chapter \ref{chap:experiments}] presents the experimentations, the software, how the data was acquired, and finally the results, with examples of maps generated from different datasets.
\item[Chapter \ref{chap:conclusion}] presents the conclusions and the future works with some suggestions of possible improvements.
\end{description}

\chapter{Background}
\label{chap:background}

In this chapter, we first present the Kinect camera, describe some of the features and methods commonly used in computer vision, and finally give a short introduction about SLAM concepts.

\section{Microsoft Kinect}

As mentioned in the introduction, the hardware used in this work is the Kinect, a device developed by PrimeSense, initially for the Microsoft Xbox 360 and released in November 2010. It is composed by an RGB camera, 3D depth sensors, a multi-array microphone and a motorized tilt. In this work, only the RGB and depth sensors are used to provide the input data.

\begin{figure}[H]
\centering
\includegraphics[width=0.5\textwidth]{pictures/kinect_sensor}
\caption{The Kinect sensor device (courtesy of Microsoft)}
\end{figure}

Main characteristics of the Kinect:
\begin{itemize}
 \item The RGB sensor is a regular camera that streams video with 8 bits for every color channel, giving a 24-bit color depth. Its  color filter array is a Bayer filter mosaic. The color resolution is 640x480 pixels with a maximal frame rate of 30~Hz.
 \item The depth sensing system is composed by an IR emitter projecting structured light, which is captured by the CMOS image sensor, and decoded to produce the depth image of the scene. Its range is specified to be between 0.7 and 6~meters, although the best results are obtained from 1.2 to 3.5~meters. Its data output has 12-bit depth. The depth sensor resolution is 320x240 pixels with a rate of 30~Hz.
 \item The field of view is 57\textdegree ~horizontal, 43\textdegree ~vertical, with a tilt range of $\pm$ 27\textdegree.
\end{itemize}

The drivers used are those developed at OpenNI\footnote{\url{http://www.openni.org/}} (Open Natural Interface), an organization established in November 2010, launched by PrimeSense and joined by Willow Garage, among others. The OpenNI framework offers high level functions which are mainly oriented for the gaming experience, such as gesture recognition and motion tracking. It aims to provide an abstraction layer with a generic interface to the hardware devices. In the case of the Kinect, one advantage is that the calibration of the RGB sensor with respect to the IR sensor is ensured, so the resulting RGB and depth data are correctly mapped with respect to a unique viewpoint.


\section{Features}

In this work, we focus on tracking some parts of the scene observed by the camera. In computer vision, and more specifically in object recognition, many techniques are based on the detection of points of interests on object or surfaces. This is done through the extraction of \emph{features}. In order to track these points of interests during a motion of the camera and/or the robot, a reliable feature has to be invariant to image location, scale and rotation. A few methods are briefly presented here:

\begin{description}
\item[Harris Corner] A corner detector, by Harris and Stephens~\cite{Harris88alvey}
\item[\acrshort{SIFT}]\acrlong{SIFT}, by David Lowe~\cite{lowe_2004_sift} 
\item[\acrshort{SURF}]\acrlong{SURF}~\cite{surf}
\item[\acrshort{NARF}]\acrlong{NARF}~\cite{steder10irosws}
\item[\acrshort{BRIEF}]\acrlong{BRIEF}~\cite{Calonder10-brief}
\end{description}

There are two aspects concerning a feature: the \emph{detection} of a keypoint, which identifies an area of interest, and its \emph{descriptor}, which characterizes its region. Typically, the detector identifies a region containing a strong variation of intensity such as an edge or a corner, and its center is designed as a keypoint. The descriptor is generally computed by measuring the main orientations of the surrounding points, leading to a multidimensional \emph{feature vector} which identifies the given keypoint. Given a set of features, a matching can then be performed in order to associate some pairs of keypoints between a couple of frames.

\clearpage
The features listed previously can be summarized in table \ref{tab:features}.

\begin{table}[H]
 \centering
  \begin{tabular}{c|cc}
  \hline
  Feature & Detector & Descriptor \\
  \hline
  Harris Corner & Yes & No \\
  SIFT & Yes & Yes \\
  SURF & Yes & Yes \\
  NARF & Yes & Yes \\
  BRIEF & No & Yes \\
  \hline
  \end{tabular}
 \caption {Overview of some existing features (non-exhaustive list)}
 \label{tab:features}
\end{table}

\subsection{Harris Corner}

Known as the Harris corner operator, this is one of the earliest detector, as it was proposed in 1988 by Harris and Stephens~\cite{Harris88alvey}. The  notion of corner should be taken in a wide sense as it allows to detect not only corners, but edges and more generally, keypoints. It is done by computing the second moment matrix (or auto-correlation matrix) of the image intensities, describing its local variations. One of the main limitation with the Harris operator, at least in its original version, concerns the scale invariance as the matrix should be recomputed for a different scale. Therefore, we will not give further details about this method.

\subsection{SIFT feature}

\glsreset{SIFT}
The \gls{SIFT} is a method presented by David Lowe~\cite{lowe_2004_sift}, now widely used in robotics and computer vision.
This is a method to detect distinctive, invariant image feature points, which easily can be matched between images to perform tasks such as object detection and recognition, or to compute geometrical transformations between images.

The main idea of the \gls{SIFT} method is to define a cascade of operations following an increasing complexity, so that the most expensive operations are only performed to the most probable candidates.
\begin{enumerate}
\item The first step relies on a pyramid of \gls{DoG} in order to be invariant to scale and orientation.
\item From this, stable keypoints are determined with a more accurate model.
\item The image gradient directions is then used to assign one or more orientations to the keypoints.
\item The local image gradients are then transformed to be stable against distortion and changes in illumination.
\end{enumerate}

\clearpage
\paragraph{\emph{Detector}}

One characteristic of the SIFT detector is to be scale invariant. The scale invariance ensures that a keypoint can be detected in a stable way at different scales, which is a fundamental requirement for a mobile robot. As the platform of the camera moves, the areas containing the points of interest will appear larger or smaller, relatively to a scale factor. Most of these concepts concerning the study of the scale space are based on the works of Lindeberg~\cite{Lindeberg_1994}. The general idea is shown in figure~\ref{fig:sift_dog}. First, the scale space is defined as the convolution of an image~$I$ with a Gaussian~$G$ with variance~$\sigma$.

\[ L(x,y,\sigma) = G(x,y,\sigma) * I(x,y) \]

where

\[ G(x,y,\sigma) = \frac{1}{2\pi\sigma^2}e^{-{(x^2+y^2)/2\sigma^2}} \]


The \gls{DoG} is the difference between two layers in scale space along the $\sigma$ axis:

\[
\begin{array}{rcl}
D(x,y,\sigma) & = & (G(x,y, k\sigma) - G(x,y,\sigma)) * I(x,y) \\
 & = & L(x,y,k\sigma) - L(x,y,\sigma)
\end{array}
\] 

This provides a close approximation to the scale-normalized Laplacian of Gaussian $\sigma^2\nabla^2G$, as shown by Lindeberg~\cite{Lindeberg_1994}:

\[ \sigma\nabla^2G = \frac{\partial G}{\partial \sigma} \approx \frac{G(x,y,k\sigma)-G(x,y,\sigma)}{k\sigma - \sigma} \]

and therefore:

\[ G(x,y,k\sigma) - G(x,y,\sigma) \approx (k-1)\sigma^2\nabla^2G \]

The $\sigma^2$ factor allows the scale invariance. Then, the localization of the keypoints is done by finding the extrema in the \gls{DoG}, which approximates the Laplacian~$\sigma^2\nabla^2G$. Lowe shows that the remaining factor $(k-1)$ does not influence the location. 

\begin{figure}[h]
\centering
\includegraphics[width=0.8\textwidth]{figures/sift_dog}
\caption{Difference of Gaussians at different scales (courtesy of David Lowe). For each octave of the scale space, the image is convolved with a Gaussian, which variance~$\sigma$ is multiplied each time by a constant factor. Two consecutive results are substracted to give the DoG which approximates the Laplacian of Gaussian. After each octave, the image is downsampled by two, and the process is repeated doubling the variance~$\sigma$. The initial value of~$\sigma$ can be changed accordingly to the type of application. }
\label{fig:sift_dog}
\end{figure}

\paragraph{\emph{Descriptor}}

In order to provide a good basis for the matching, the descriptor has to be highly distinctive and invariant to remaining variations such as change in illumination or 3D viewpoint. The \gls{SIFT} descriptor is based on a \gls{HoG} around the keypoint. It is stored as a 128 dimensional feature vector (4x4 descriptor with 8 orientation bins). The figure~\ref{fig:sift_descriptor} illustrates this concept with a smaller 2x2 descriptor.

\begin{figure}[h]
\centering
\includegraphics[width=0.8\textwidth]{figures/sift_descriptor}
\caption{SIFT descriptor (courtesy of David Lowe). Illustration of how the gradient directions are accumulated into orientation histograms around the center of the region defined by the keypoint. The gradients are first weighted by a Gaussian window represented by the circle. From the 8x8 sample, a 2x2 descriptor is computed this way, where each bin covers a 4x4 subregion.}
\label{fig:sift_descriptor}
\end{figure}

\clearpage
\paragraph{\emph{Matching}}

Once the keypoints are found and the descriptors computed, the next step is to perform fast searches on the features in order to identify candidate matching, to associate some pairs of features between the two sources. Refer to the method described in~\cite{lowe_2004_sift} for further details.

\begin{figure}[h]
\centering
\includegraphics[width=0.8\textwidth]{pictures/sift_matches_robhess}
\caption{SIFT features matched between the two images (courtesy of Rob Hess)}
\end{figure}

For this work, we use the \gls{SIFT} library developed by Rob Hess~\cite{hess_sift}. Additionally to the feature extraction, this library provides a useful function to perform the initial matching through a kd-tree. There is also a set of template functions that can be used to compute geometrical transformations with \gls{RANSAC} (method described in ~\ref{sub:ransac}), but the given interface is written for 2D operations and therefore will not be used.

\subsection{SURF feature}

\glsreset{SURF}
The \gls{SURF} provides a robust detector and descriptor~\cite{surf}, that can be used in computer vision tasks like object recognition or 3D reconstruction. It is partly inspired by the \gls{SIFT} descriptor, both are using local gradient histograms. The main difference concerns the performance, lowering the computational time through an efficient use of integral images for the image convolutions, Hessian matrix-based detector (optimized through approximations of the second order Gaussian partial derivatives, see figure~\ref{fig:surf_detector}), and sums of approximated 2D Haar wavelet responses for the descriptor (see figure~\ref{fig:surf_descriptor}). The standard version of \gls{SURF} is several times faster than \gls{SIFT} and claimed by its authors to be more robust against different image transformations than \gls{SIFT}.

\begin{figure}[H]
\centering
 \begin{tabular}{cc}
 \includegraphics[width=0.4\textwidth]{figures/surf_lyy} &
 \includegraphics[width=0.4\textwidth]{figures/surf_lxy}
 \end{tabular}
\caption{SURF detector relies on approximation of the Gaussian second order derivatives.}
\label{fig:surf_detector}
\end{figure}

\begin{figure}[H]
\centering
\includegraphics[width=0.4\textwidth]{figures/surf_scale}
\caption{SURF scale space is built by changing the filter size.}
\label{fig:surf_scale}
\end{figure}

\begin{figure}[H]
\centering
 \begin{tabular}{cc}
 \includegraphics[width=0.4\textwidth]{figures/surf_orientation} &
 \includegraphics[width=0.4\textwidth]{figures/surf_descriptor}
\end{tabular}
\caption{SURF keypoint orientation is found by summing up the response of Haar wavelets with a sliding window of 60 degrees. The descriptor is finally computed by splitting the region around the keypoint into 4x4 subregions and computing Haar wavelet responses weighted with a Gaussian kernel.}
\label{fig:surf_descriptor}
\end{figure}

\clearpage
\subsection{NARF feature}

\glsreset{NARF}
The \gls{NARF} is presented by Bastian Steder in~\cite{steder10irosws}. It is meant to be used on single range scan obtained with 3D laser range finders or stereo camera. This feature is available in the \gls{PCL}~\cite{Rusu_ICRA2011_PCL}, which is part of the \gls{ROS}, but also released as a standalone library. Its detector looks for stable areas with significant change in vicinity, that can be identified from different viewpoints. The descriptor characterizes the area around the keypoint by calculating a normal aligned range value patch and finding the dominant orientation of the neighboring pixels.

\begin{figure}[H]
\centering
\includegraphics[width=0.8\textwidth]{figures/narf_descriptor_visualization}
\caption{NARF feature (courtesy of Bastian Steder). From the range image, a descriptor is computed at the given keypoint localized by the position of the green cross. The right figure shows the surface patch containing the corner of the armchair and the dominant orientation which is pointed by the red arrow. Each component of the descriptor is associated to a direction given by one of the green beams. The bottom left picture shows the values of this descriptor. The middle left figure shows the distance to this descriptor for any point in the scene, and therefore their similarity with the current keypoint. The darkest areas are the closest, identifying the areas which are the most similar to the left corner of the armchair.}
\end{figure}

\begin{figure}[H]
\centering
\includegraphics[width=0.4\textwidth]{pictures/narf1}
\caption{NARF keypoints computed on a desk}
\end{figure}

\subsection{BRIEF feature}

\glsreset{BRIEF}
The \gls{BRIEF} presented in~\cite{Calonder10-brief} is an efficient alternative for the descriptor, based on binary strings computed directly from image patches, and measures the Hamming distance instead of the $L_2$ norm commonly used for high dimension descriptors. As the binary comparison can be performed very efficiently, the matching between several candidates can be done much faster. The use of a \gls{BRIEF} descriptor supposes the keypoints are already known, this can be done with a detector such as \gls{SIFT} or \gls{SURF}. A deeper study is available in the PhD thesis of Calonder~\cite{Calonder10_PhD}, who is the main author of the \gls{BRIEF} features. This document presents comparative evaluation of \gls{BRIEF}, \gls{SIFT} and \gls{SURF}. The main interest of this descriptor resides in its performance.

\section{Computing a transformation}

Once the features have been computed on the whole image, the goal is to track them during the movement of the camera. This is done by associating them between different frames. Here, we only consider the matching for a couple of frames, the process can then be repeated on the whole sequence of frames. To associate several pairs of features is not straightforward, as the descriptors are not exactly the same between two different frames, first as a consequence of the movement of the camera and also because of the sensory noise. The best matching then consists in finding the correct associations with a good belief. Most of the algorithms work with different steps, first from a sparse level where the hypothesis is wide, to eliminate the most obvious mismatches at lower computational cost, and then refined. The matches that fit to the model are called the \emph{inliers}, while the matches being discarded are called the \emph{outliers}.

\subsection{RANSAC}
\label{sub:ransac}

\glsreset{RANSAC}
The \gls{RANSAC} is an iterative method, widely known in computer vision, to estimate the parameters of a transformation given a dataset. It was first published in 1981 by Fischler and Bolles~\cite{FischlerB81}. Generally, by using a least square approach, the parameters which satisfy the whole dataset can be found, but this is likely not the optimal solution when there are some noisy points. Hence, the idea is to find the parameters which are valid for \emph{most} of the points, a consensus, by discarding the noisy points. For each iteration, a very small number of samples are \emph{randomly} selected to define a model, representing one hypothesis. This model is then evaluated for the whole dataset with an error function. This is repeated several times by choosing new samples for each iteration and keeping the best transformation found. After running this for a fixed number of steps, the algorithm is guaranteed to converge to a better transformation (with a lower error), but it does not necessarily find the best one, as all the possibilities have not been tested. The general algorithm is presented at page~\pageref{alg:generic_ransac}.

\begin{algorithm}
\caption{General RANSAC}\label{alg:generic_ransac}
\begin{algorithmic}
\REQUIRE Dataset of points
\STATE $bestModel,\:bestInliers \gets \emptyset$
\STATE Define the number of iterations $N$
\FOR{$iteration = 1$ to $N$}
 \STATE $samples \gets$ Pickup k points randomly
 \STATE Compute $currentModel$ from $samples$ (base hypothesis)
 \STATE $inliers \gets \emptyset$
 \FORALL{points}
  \STATE Evaluate $currentModel$ for the point and compute its error
  \IF{$error < threshold$}
   \STATE $inliers \gets inliers + point$
  \ENDIF
 \ENDFOR
 \STATE Count number of inliers and compute mean error
 \IF{$currentModel$ is valid}
  \STATE Recompute $currentModel$ from $inliers$
  \IF{$currentModel$ better than $bestModel$}
   \STATE $bestModel \gets currentModel$
   \STATE $bestInliers \gets inliers$
  \ENDIF
 \ENDIF
\ENDFOR
\RETURN $bestModel$, $bestInliers$
\end{algorithmic}
\end{algorithm}

\begin{figure}
\centering$
\begin{array}{c|c|c}
\subfloat[]{\label{fig:exransac1} \includegraphics[width=0.3\textwidth]{figures/ransac1}} &
\subfloat[]{\label{fig:exransac2} \includegraphics[width=0.3\textwidth]{figures/ransac2}} & 
\subfloat[]{\label{fig:exransac3} \includegraphics[width=0.3\textwidth]{figures/ransac3}} \\
\end{array}$
\caption{Illustration of a RANSAC iteration, for a 2D dataset. \protect\subref{fig:exransac1}~First, k items are chosen randomly among the set (here k=2). \protect\subref{fig:exransac2}~From these points, a model is defined. Here, a line is drawn between the 2 chosen points, with an area of validity defined by a given threshold. \protect\subref{fig:exransac3}~The model is then evaluated by measuring the error for each point, here by computing the distance to the line. This separates the dataset into two subsets: the inliers, in green (fitting to the model) and the outliers (ignored). The ratio of inliers relatively to the total number of items can give a numerical evaluation of the model ("goodness"). The initial model is generally recomputed taken all the inliers into account.}
\end{figure}

The main difficulty with the \gls{RANSAC} algorithm is to define the number of iterations and the thresholds.
For the number of iterations, it is possible to define the ideal value according to a desired probability. Considering a sample of $k$ points, randomly chosen, if $p$ denotes the probability that at least one of the sample set does not include an outlier, $u$ the probability of observing an inlier, we have then:

\[
1-p = (1-u^k)^N
\]

It can be more convenient to define $v=1-u$ as the probability of observing an outlier. From this, we can obtain the required number of iterations:

\[
N = \frac{log(1-p)}{log(1-(1-v)^k)}
\]

For the threshold, it depends of the problem and the application, generally this is done empirically. There are many variants of the \gls{RANSAC} that try to overcome this problem, such as the MLESAC method \cite{TorrZ00} using M-estimators in order to find the good parameters in a robust way.


\subsection{ICP}

\glsreset{ICP}
The \gls{ICP} is an algorithm presented by Zhang~\cite{zhang_92_icp}.
It iteratively revises the rigid transformation needed to minimize the distance between two sets of points, which can be generated from two successive raw scans. Considering the two sets of points $(p_{i}, q_{i})$, the scope is to find the optimal transformation, composed by a translation $t$ and a rotation $R$, to align the source set $(p_{i})$ to the target $(q_{i})$. The problem can be formulated as minimizing the squared distance between each neighboring pairs:

\[min \sum_{i}{||(Rp_{i}+t)-q_{i}||}^2\]

As any gradient descent method, the \gls{ICP} is applicable when we have in advance a relatively good initial guess. Otherwise, it is likely to be trapped into a local minimum. One possibility, as done by Henry~\cite{Henry_RGBD_2010}, is to run \gls{RANSAC} first to determine a good approximation of the rigid transformation, and use it as the first guess for the \gls{ICP} procedure.

\clearpage
\section{General concepts for SLAM}

As described in the introduction, the \gls{SLAM} problem can be defined as looking for the best estimation of the robot/camera poses, by reducing the uncertainty due to the noise affecting the measurements. With the probabilistic approach, one way is to use methods such as Expectation Maximization, described for example in~\cite{Thrun_2005}. The underlying idea is to compute a most likely map, and every time the estimation of the robot pose is known with more accuracy, the previously computed map is updated and the process is repeated. The momentary estimation of the position, or $belief$, is represented by a probability density function. 

By denoting $x_t$ the state of the robot at time t (representing the robot's motion variables such as the poses), $z_t$ a measurement (for example camera images or laser range scans), and $u_t$ the control data (denoting the changes of state), the belief over the state variable $x_t$ is given by:

\[bel(x_t) = p(x_t | z_{1:t}, u_{1:t})\]

This posterior is the probability distribution over the state $x_t$ at time $t$, conditioned on all past measurements $z_{1:t}$ and all past controls $u_{1:t}$.

In the case of \gls{VSLAM} with the Kinect, we consider the video sequence as a collection of frames, defined by a flow of RGB and depth data. Here, each measurement $z_t$ is a couple of RGB and depth frames at time $t$ . The total number of frames then depends on the duration of the sequence and the effective frame rate, which is bounded by the maximum rate of the Kinect which is 30~Hz.


\subsection{Filtering vs Smoothing}

To solve the \gls{SLAM} problem, the literature presents different approaches that can be classified either as filtering or smoothing. Filtering approaches model the problem as an on-line state estimation where the state of the system consists in the current robot position and the map. The estimate is augmented and refined by incorporating the new measurements as they become available.
The most common techniques are the \glsreset{EKF}\gls{EKF} and the particle filters~\cite{Thrun_2005}. To highlight their incremental nature, the filtering approaches are usually referred to as on-line \gls{SLAM} methods. The drawback for these techniques is the computational cost for the particle filter as it maintains several hypothesis of the state variables, which can grow fast as the robot moves, while the \gls{EKF} only keeps one hypothesis.

Conversely, smoothing approaches estimate the full trajectory of the robot from the full set of measurements. These approaches address the so-called full \gls{SLAM} problem, and they typically rely on least-square error minimization techniques. Their performance could also have been an issue in the past, but now they are an interesting alternative. In this work, we will therefore study the results that can be obtained from a graph-based approach. In contrast to the online techniques, the pose graph is said to be offline or a "lazy" technique as the optimization processing is triggered by specific constraints.

\subsection{Pose Graph}

The \gls{SLAM} problem can be defined as a least squares optimization of an error function, and it can be described by a graph model, combining graph theory and probabilistic theory. Such a graph is called a pose graph, where the nodes (or vertices) represent a state variable describing the poses of the cameras, and the edges represent the related constraints between the observations connecting a pair of nodes. Not only the graph representation gives a quite intuitive and compact representation of the problem, but the graph model is also the base for the mathematical optimization, where the purpose is to find the most-likely configuration of the nodes. Here, we present different frameworks:

\begin{description}
\item[GraphSLAM] In probabilistic robotics~\cite{Thrun_2005}, beliefs are represented through conditional probability distributions. A belief distribution assigns a probability (or density value) to each possible hypothesis with regards to the true state. Belief distributions are posterior probabilities over state variables conditioned on the available data. The idea behind GraphSLAM~\cite{Thrun05_GraphSLAM} is to represent the constraints, defined by the likelihoods of the measurements and motions models, in a graph. The computation of the posterior map is achieved by solving the optimization problem, taking all the constraints into account.
\item[TORO] Tree-based netwORk Optimizer ~\cite{grisetti07rss}. A least square error minimization technique based on a network of relations is applied to find maximum likelihood (ML) maps. This is done by using a variant of gradient descent to minimize the error, by taking steps proportional to the negative of the gradient (or its approximation). It is used in the 3D dense modeling project led by Henry~\cite{Henry_RGBD_2010}.
\item[HOG-Man] Hierarchical Optimizations on Manifolds for Online 2D and 3D mapping~\cite{hogman_2010}: this approach considers different levels. Solving the problem on lower levels affects only partially the upper levels, and therefore it seems to be well adapted to handle larger and more complex maps. It has been used in the RGBD-6D-SLAM~project~\cite{engelhard11euron-workshop}.
\item[$g^2o$] General Framework for Graph Optimization~\cite{g2o_2011}, by the same authors of HOG-Man~\cite{hogman_2010}. $g^2o$ is a framework which gives the possibility to redefine precisely the error function to be minimized and how the solver should perform it. Some classes are provided for the most common problems such as 2D/3D \gls{SLAM} or Bundle Adjustement. The HOG-Man concepts are likely to be integrated in $g^2o$. This is the method chosen to solve the graph problem in this work.
\end{description}

\clearpage
\subsection{Loop closure}

In order to minimize the error, the graph optimization relies on constraints between the nodes. One interesting event occurs when the robot moves back to a position already visited in the past. This is refered as the \emph{loop closure} detection. To illustrate it, let us take an example. Imagine you are visiting a new city and you have just arrived at your hotel. Without taking a map, you would like to explore the neighborhoods but you still want to keep track of the path you follow with a sparse definition of the environment (mentally, or drawing it on a paper). After walking for a while around a few blocks, you don't know any longer exactly where you are. However, by following a circular path, or a squared path considering standard blocks, you will probably pass again by places you have already visited. By recognizing these places, you are now able to estimate more precisely the path you actually followed and correct the related map.

The loop closure detection relies basically on this idea. Without making an assumption on the path followed by the robot, and simply by keeping the history of the past frames, it is possible to check if the current frame matches with some of the previous ones. If the current observation contains enough similarities with another observation seen in the past, a transformation can be computed between these frames and a new constraint can be inserted from it. This can be repeated several times. With these new constraints, the cumulated error of the graph can then be considerably reduced. Theoretically, all the poses could be linked together this way, if such a relation exists. 

\begin{figure}[H]
\centering$
\begin{array}{cc}
\subfloat[]{\label{fig:lc1} \includegraphics[width=0.4\textwidth]{figures/graph_lc1}} &
\subfloat[]{\label{fig:lc2} \includegraphics[width=0.4\textwidth]{figures/graph_lc2}}
\end{array}$
\caption{Loop closure detection. \protect\subref{fig:lc1}~The poses are linked together by standard edges, each one represents a transformation. \protect\subref{fig:lc2}~The observation of a common point of interest already seen in the past, if significant, can trigger the creation of a new link between two poses that were previously separated in the whole sequence.}
\end{figure}

\clearpage
\subsection{Graph optimization}

The detection of loop closures leads to new constraints that can be inserted into the graph as new edges, linking the related poses. The optimization of the graph consists in finding the optimal configuration of vertices to respect all the given constraints, as illustrated in figure~\ref{fig:graph_overview}.

\begin{figure}[H]
\centering
\includegraphics[width=1\textwidth]{figures/graph_overview}
\caption{Overview of the graph optimization procedure}
\label{fig:graph_overview}
\end{figure}

\clearpage
\section{Summary}

This chapter gives an overview of some of the main concepts used for \gls{SLAM} and more specifically \gls{VSLAM}, defining the outlines of a possible approach to solve the problem. The features allow to track some points of interest and can be used to evaluate a unitary move between a couple of frames. By combining them together, an estimation of the trajectory of the camera could be estimated. With the use of a pose graph, and after detecting loop closures, the graph could be optimized in order to reduce the global drift error. From the corrected camera poses, a map can then be built. 

While the performance is not the main issue of this work, it is still possible to make a distinction between the operations that can be performed \emph{online} (while the robot is moving) and the operations done \emph{offline} (after a given sequence of moves). While the \gls{SLAM} addresses both Localization and Mapping, the solution based on the graph implies the detection of loop closures to give a correct result. Therefore, the dual problem of building the map and localizing the robot at the same time is more likely to be done \emph{offline}. This is one of the main constraint resulting of this approach. However, once the map is built, the single operation of localization could be done on the basis of the relevant features that have been used during the initial \gls{SLAM} operation. This kind of approach could allow the robot to build a map as a first step, only for this purpose. Once the map is built, the localization and further operations could be done, but this mainly depends on the tasks the robot has to fulfill.

\chapter{Feature matching}
\label{chap:features}

In the context of this work, the goal is to track some keypoints precisely and to evaluate the motion of the camera from the keypoint coordinates. In this chapter, we present how the extraction of features is performed and how the keypoints are matched, in order to compute a rigid transformation between a couple of frames.

\section{Feature extraction}

The first question is the choice of the features, starting from those listed in table~\ref{tab:features}. As the performance is not an issue yet, we will rather consider the well known \gls{SIFT} and \gls{SURF}, for which many implementations exist, but also the more recent \gls{NARF}~\cite{steder10irosws}. The \gls{BRIEF} descriptor can then be introduced later if necessary, assuming the detector is already given. One important difference is that \gls{SIFT} and \gls{SURF} are computed from the RGB data in 2D, while the \gls{NARF} keypoints are computed from the 3D point cloud, involving both RGB and depth data. Before looking at scale and rotation invariance, one important condition that can easily be verified is the stability of the keypoints if the camera does not move. 

We first tested the \gls{NARF} descriptor provided by the \gls{PCL}. The keypoints showed to be unstable while the camera was not moving. These variations could be explained by the noise of the depth data of the Kinect, affecting the \gls{NARF} detector. A better tuning may give more stable results, but apparently this feature was rather designed for range scans such as laser range finders or stereo cameras which are generally more accurate, or less sensible to noise than the Kinect sensor. For this reason, we preferred to work with the 2D image, first using \gls{SIFT} feature which is widely used in the community and known to be robust. Here, we tested the library developed by Rob Hess~\cite{hess_sift}. In order to keep only features that are interesting for the next steps of the processing, the features are removed if the depth data on this point is not available (occlusion) or if the depth data is outside a fixed range of depth distance. Here, a maximum range of 5 meters is used.

\clearpage
\section{Comparison of SIFT and SURF}
After testing simple extractions and getting satisfying results with the \gls{SIFT} feature, the same experiments were repeated using the \gls{SURF} features. The implementation provided in OpenCV\footnote{\url{http://opencv.willowgarage.com/}} was used. We noticed a considerable gain in performance for a similar number of features. Globally, for a scene requiring about 1000ms with \gls{SIFT}, the extraction of the \gls{SURF} feature takes about 250ms, meaning that the time is reduced by a factor of 4 which is significant.

Both algorithms have various parameters that will result in more or less features at the cost of computational time. In particular, for \gls{SIFT} it concerns the number of scales $k$ computed for each octave (see figure~\ref{fig:sift_dog}) and the variance $\sigma$ of the Gaussian. Here, the default values of the \gls{SIFT} library~\cite{hess_sift} are used, which are also the standard values recommended by David Lowe ($k=3$, $\sigma=1.6$). Refer to the method~\cite{lowe_2004_sift} for the description of these parameters. For \gls{SURF}, another parameter is the threshold of the Hessian. Only features with Hessian larger than this value are extracted.

In order to compare both methods, we first tried to find a configuration of parameters giving similar number of features, using a dataset of 6474 frames described in chapter \ref{chap:experiments} (see also the description of the hardware and software configuration). For an image of 640x480 pixels, the number of octaves set in the SIFT Library is 6. By setting the same number of octaves at 6 and a Hessian threshold of 300 for \gls{SURF}, we get a very close number of features. Then, we compare with the default settings which implies a number of octaves of 3 and a Hessian threshold of 500. The results are summarized in the table \ref{tab:stats_features}.

\begin{table}[h]
 \begin{center}
 \begin{tabular}{r|ccc}
 & Avg nb features & Avg time & Avg time by feature \\
 &  & (in ms) & (in ms) \\
 \hline
 SIFT 6 & 745 & 894 & 1.407\\
 SURF 6/300 & 711  & 345  & 0.582 \\
 SURF 3/500 & 384 & 178 & 0.586 \\
 \end{tabular}
\caption{Comparison of SIFT and SURF features, averaged for 6474 frames. For a similar number of features, SURF is almost 3 times faster than SIFT. For half number of features, there is a gain of factor 5.}
\label{tab:stats_features}
\end{center}
\end{table}

\clearpage

\begin{figure}[H]
\begin{center}
\includegraphics[width=0.7\textwidth]{figures/stats_features_time}
\caption{Comparison of SIFT and SURF -- time for computing features. SURF is about 2.6 times faster than SIFT for a similar number of features. For different SURF parameters, the time is linearly proportional to the number of features (x2 faster for x2 less features).}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=0.7\textwidth]{figures/stats_features_tbf}
\caption{Comparison of SIFT and SURF -- average time by feature. SURF is fastest and it does not depend on the given parameters.}
\end{center}
\end{figure}

\clearpage
\section{Initial Matching}

Now that the features can be computed, it is possible to look for their matching between a couple of frames. The initial matches can be found through a nearest neighbor search, using the Euclidean distance computed on the descriptors of these features.  
The SIFT Library~\cite{hess_sift} proposes an implementation of the Best-Bin-Search described by Lowe~\cite{lowe_2004_sift}, based on kd-tree. For each feature in the target frame, the two nearest neighbors are searched in the source frame. If these neighbors are close enough (with respect to a fixed threshold), then the result is considered to be valid. This initial search is done on the 2D features. To increase robustness, matches are rejected for those keypoints where the ratio of the nearest neighbor distance to the second nearest neighbor distance is greater than a given threshold, set to 0.5 here. After this, the depth information is used to keep the features that are close enough to the camera, as the accuracy of the depth information is not considered reliable enough above a given range (around 5-6 meters).

\begin{figure}[H]
\centering
\includegraphics[width=0.5\textwidth]{pictures/sift_matching_init}
\caption{Initial matching of SIFT features}
\end{figure}

\section{Estimation of the 3D transformation}
\label{sec:transformation}

From the matching pairs, it is possible to find a rigid transformation binding the two sets of points, i.e. the operation that projects each feature point of a source frame to the corresponding point in the target frame. In our case, this transformation is a perspective projection for 6 degrees of freedom, composed by a rotation matrix and a translation vector in 3 dimensions. This transformation can be written by using a 4x4  matrix and homogeneous coordinates. We can then project a point P where $P = (x,y,z,1)^T$ simply by applying this transformation matrix to the point:

\[
P' = T_{transformation} \: P
\]

The points defined by the initial matching pairs need to be converted in 3 dimensions. For this, we need to define a proper coordinate system. To make the next steps easier (graph optimization and scene reconstruction) and avoid further conversions, we keep the same coordinate system for all the work, defined as follows:

\[
\left\{\begin{array}{l}
x: depth,\:positive\:forward\\
y: height,\:positive\:upwards\\
z: width,\:positive\:to\:the\:right\\
\end{array}
\right.
\]

\begin{figure}[H]
\centering
\includegraphics[width=0.8\textwidth]{figures/coordinates}
\caption{The two different coordinate systems, from the "screen" (as seen by the sensors RGB-D) to the 3D scene in the real world.}
\end{figure}

From the depth value (given in mm by OpenNI) and the screen coordinates (with RGB color and a resolution of 640x480 pixels), it is possible to compute the points coordinates in 3D. Let (u,v) be the point coordinates in pixels, we have then:

\[
P(x,y,z)\left\{
\begin{array}{l}
x = depth \\
y = -(v - 480/2) * depth / focal\_length \\
z = (u - 640/2) * depth / focal\_length \\
\end{array}
\right.
\]

\paragraph{}
Considering the initial matches, the next step is then to find a transformation matrix, that gives a satisfying projection for \emph{most} of these points. Ideally, we could simply compute a transformation by a least square method for all these points, each matching pair defining an equation. The minimum number of pairs required for this would be three, and the more points, the better defined the system would be. However, because of the uncertainty due to the sensory noise and the feature matching by itself, a single point which location is incorrectly estimated could lead to a significant error in the final transformation. Therefore, a better transformation can be found iteratively with the \gls{RANSAC} method described in the background~(\ref{sub:ransac}). In this context, we can precise the algorithm:

\begin{algorithm}[H]
\caption{Find the 3D transformation with RANSAC}
\begin{algorithmic}
\REQUIRE initial pairs of 3D points (origin, destination)
\STATE $bestTransform,\:bestInliers \gets \emptyset$
\STATE Define the number of iterations $N$
\FOR{$iteration=1$ to $N$}
 \STATE $samples \gets$ pickup randomly k pairs (origin, destination)
 \STATE Compute $currentTransform$ from $samples$
 \STATE $inliers \gets \emptyset$
 \FORALL{pairs of points}
  \STATE $projected_i \gets$ projectPoint($origin_i$, $currentTransform$)
  \STATE $error \gets$ computeDistance($projected_i - destination_i$)
  \IF{$error$ < $threshold$}
   \STATE{$inliers \gets inliers + pair_i$}
  \ENDIF
 \ENDFOR
 \STATE Count number of inliers and compute mean error
 \IF{$currentTransform$ is valid}
  \STATE Recompute $currentTransform$ from $inliers$
  \IF{$currentTransform$ better than $bestTransform$}
   \STATE $bestTransform \gets currentTransform$
   \STATE $bestInliers \gets inliers$
  \ENDIF
 \ENDIF
\ENDFOR
\RETURN $bestTransform$, $bestInliers$
\end{algorithmic}
\end{algorithm}

As mentioned in the section \ref{sub:ransac}, it is possible to estimate the optimal value of~N to satisfy a target probability of inliers ratio. In this work we use a fixed value given by parameter of N=20. It may be tuned more accurately with a deeper study of the \gls{RANSAC} procedure. Generally, if the quantity of features is high, the probability of having an outlier is lower at this step, as most of the mismatches have already been excluded by the initial matching.

%If we define that the mean number of features is 500 with 10 outliers (after different experiments) then we have:
%\[
%v=10/500=1/50
%N = \frac{log(1-p)}{log(1-(1-v)^k)} = \frac{log(1-0.99)}{log(1-(1-\frac{1}{50})^3)} = 
%\]

To compute a transformation (an hypothesis), only the k samples from the initial matches are used each time. The 3D points are converted to homogeneous coordinates and the transformation is given by solving the corresponding equation from the known constraints which are defined by the chosen points. This is can be done through a \gls{SVD} of the covariance data. To do this, we use the minimal number of points to solve this equation, k=3.

To evaluate a transformation, each sample taken from the initial matches (source) is then projected according to this transformation. A 3D vector is computed from the difference between the projected point and the real point taken from the matching point (destination). The error is the norm of this vector.

\section{Analysis}

Some preliminary experiments were conducted to determine how to find a satisfying transformation between a couple of frames. In this context, the unitary move between two frames should be generally small if we consider a limited motion of the camera (by amplitude and speed). The larger the move, the more difficult it becomes. To do this, a program was written to perform the feature extraction with the initial matching and the RANSAC iterations, taking a couple of RGB-D frames as an input.

\begin{figure}[H]
\centering$
 \begin{array}{cc}
 \subfloat[Initial matching]{\label{fig:sift_match_1} \includegraphics[width=0.35\textwidth]{pictures/sift_matching_init}} &
 \subfloat[After RANSAC]{\label{fig:sift_match_2} \includegraphics[width=0.35\textwidth]{pictures/sift_matching_ransac}}
 \end{array}$
\caption{Matching of SIFT features: \protect\subref{fig:sift_match_1}~initial matching from KNN search \protect\subref{fig:sift_match_2}~after running RANSAC. Some of the initial matches drawn in diagonal (green) in the left figure are excluded and become outliers (red) after RANSAC.}
\end{figure}

\clearpage
\paragraph{\emph{Characterizing a transformation}}

There are at least 3 parameters:
\begin{itemize}
\item Mean error: the norm of the error vector. Lower is better.
\item Number of inliers: the absolute number of inliers. Higher is better.
\item Ratio of inliers: the relative number of inliers with respect to the initial matches. Higher is better.
\end{itemize}

The main difficulty is to find the best balance among these 3 values, by putting some thresholds. Setting too high constraints would lead to the impossibility to find a transformation satisfying all the criteria. This is true especially when there are not enough features detected. Setting too low constraints helps to find a transformation even when there are less features or the measurements are noisy, but the resulting transformation will be less precise. The main risk here is to set too loose constraints so the inliers would include some mismatches. If the criteria are too permissive, this would result in an invalid model, leading to incorrect associations, as shown in figure~\ref{fig:bad_inliers}. 

\begin{figure}[H]
\centering$
\includegraphics[width=0.5\textwidth]{pictures/bad_inliers1}$
\caption{Example of an incorrect model. The matches are incorrect, as the scenes and objects are completely different. But the green lines are considered to be inliers, due to the similarity of the dark lines. Here, the threshold defining the minimum ratio of inliers with respect to the initial matches is too low.}
\label{fig:bad_inliers}
\end{figure}

\paragraph{\emph{Quality of a transformation}}

A good transformation should be valid for most of the given points. But this definition is not enough, as only considering the number of inliers is not necessarily the best choice. Another criterion could be the spread of the inliers. If many inliers are concentrated on a small area, they don't give much added value with respect to the global transformation of the scene. A better transformation may be found including more distant points, if their projection error is just slightly above the threshold. This could be measured by computing the mean position of the inliers and, from this, the standard deviation, or more simply the variance of the inliers.

Let N be the number of inliers and $p_i$ be the i-th inlier vector. We can then compute the mean vector $\mu$ and the variance $\sigma^2$ with their standard definitions:
\[
\mu = \frac{1}{N} \sum_{i=1}^N{p_i}
\;\;\;\;\;\;\;\;
\sigma^2 = \frac{1}{N} \sum_{i=1}^N{(p_i - \mu)^2}
\]

\begin{figure}[H]
\centering$
 \begin{array}{ccc}
 \includegraphics[width=0.33\textwidth]{pictures/bad_transform1} &
 \includegraphics[width=0.33\textwidth]{pictures/bad_transform2} &
 \includegraphics[width=0.33\textwidth]{pictures/bad_transform3}
 \end{array}$
\caption{Sequence showing different distributions of the inliers. Note how the inliers are grouped in the case shown in the middle.}
\label{fig:bad_transform}
\end{figure}

\begin{table}[h]
\begin{center}
\begin{tabular}{ccll}
 inliers/matches & ratio & $\sigma^2(2D)$ & $\sigma^2(3D)$\\
 \hline
 93/114 &	81\% &	0.345409 &	1.2232\\
 34/119 &	28\% &	0.0203726 &	0.0519737\\
 91/135 &	67\% &	0.476902 &	1.42809\\
 %90/140 &	64\% &	0.682466 &	1.56629\\
\end{tabular}
\end{center}
\caption{Ratio of inliers shown in figure~\ref{fig:bad_transform}, variance of the keypoints in 2D (without depth information) and 3D.}
\end{table}

The variance may be used to detect these situations. Instead of using the ratio of inliers as the criterion to measure the quality of the model, the variance could be used, not only as an evaluation of the goodness ("score") but also as a criterion for the selection of the $k$ elements of each sample. For example, the initial elements would have above a minimal distance from the mean, set by a threshold. This concept could not be developed in the given time, but it could be studied in future works. 

\section{Summary}

In this chapter, we presented a method to compute a rigid transformation from a couple of frames given their RGB-D data. This transformation represents the motion of the camera between two consecutive frames, for 6 degrees of freedom. This was done through the following steps:

\begin{enumerate}
\item first, the \gls{SIFT}/\gls{SURF} features are extracted from each RGB frame (using only 2D);
\item an initial matching is performed with the use of a kd-tree, and the depth information is integrated to compute the feature positions in 3D;
\item from this set of pairs of features, a transformation is computed by running a \gls{RANSAC} algorithm.
\end{enumerate}

The figure~\ref{fig:system_overview_rgb_depth} illustrates how the RGB and depth input data can be used to compute the sequence of 3D transformations, and subsequently, the initial estimation of the poses.

\begin{figure}[H]
\centering
\includegraphics[width=1.0\textwidth]{figures/overview_rgb_depth}
\caption{Processing of the RGB-D data}
\label{fig:system_overview_rgb_depth}
\end{figure}

\chapter{Building a map}
\label{chap:map}

In the previous chapter, we saw how to determine a rigid transformation between a couple of frames. From each single transformation, an estimation of the poses of the camera can be computed. Now we can combine them to proceed with a sequence of frames. Each pose is then inserted into the graph by converting its representation from the camera matrix into a graph node. From the loop closures, new constraints can be inserted in the graph. The graph can then be optimized to minimize an error function, and the poses are updated according to the new vertices given by the graph after optimization, reducing the global drift.

\section{Estimating the poses}

Knowing the initial pose, the first step is to determine an estimation of any pose after a succession of transformations. Considering a finite sequence of frames $[frame_0 ; frame_N]$, let~$P_k$ be the pose at rank $k \in [0;N]$. For 6\gls{DOF}, composed by 3 axis of rotation and translation, it can be represented by a 4x4 matrix with homogeneous coordinates, where $R$ is a 3x3 rotation matrix and $t$ a translation vector:

$$
P_k = \left[ \begin{array}{cccc}
 \multicolumn{3}{c}{\multirow{3}{*}{$R$}} & t_x \\
 & & & t_y \\
 & & & t_z \\
0 & 0 & 0 & 1 \end{array} \right] 
$$

For~$i>0$, we have the transformation~$T_{i_{-1}}^i$ that binds the position~$P_{i-1}$ to the position~$P_i$. Each transformation $T$ can also be represented by a similar 4x4 matrix. If~$P_0$ determines the initial position, we can then compute the position~$P_i$  by combining all the transformations like:

\begin{equation}
P_k = \prod_{i=k}^1{T_{i_{-1}}^i} \: P_0
\label{eqn:pose_estimation}
\end{equation}

As the matrix product is not commutative, it is essential to follow the correct order when multiplying the matrices. For example, for the pose~$P_3$ we have:

\[
P_3 = T_2^3 \: T_1^2 \: T_0^1 \: P_0
\]

Generally, the initial position is defined by the initial orientation of the camera for each axis, stored in the initial rotation matrix $R_0$, and the initial translation vector $t_0 = (x_0, y_0, z_0)^T$, as follows:

\[
P_0 = \left[ \begin{array}{cccc}
 \multicolumn{3}{c}{\multirow{3}{*}{$R_0$}} & x_0 \\
 & & & y_0 \\
 & & & z_0 \\
0 & 0 & 0 & 1 \end{array} \right] 
\]

As an arbitrary choice, we can define the initial position to be at the origin of the coordinate system, which is given by the identity matrix~$I_4$.

\[
P_0 = I_4 = \left[ \begin{array}{cccc}
1 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0 & 1 \end{array} \right] 
\]

\section{Initializing the graph}

The equation \ref{eqn:pose_estimation} gives an estimation of the pose that can be inserted into the $g^2o$ graph. However, a sequence of video frames can contain an important number of similar poses, especially if the robot slows down, or even stops. Inserting a node for each transformation would lead to many redundant poses. It makes good sense to insert a new pose only after a significant change from a given pose. To distinguish them among all the others that can potentially be inserted in the graph, these poses are called the \emph{keyposes}. 

A simple solution is to define a keypose only after having achieved a move, with a given amplitude defined by a threshold. By cumulating the unitary moves between each couple of frames in the sequence, the creation of a new key pose is triggered once a given distance in translation or a given angle in rotation has been reached,  with respect to the last key pose. The thresholds are set by parameters, here 0.1m for the distance and 15\textdegree  for the rotation (both values consider the variations on the 3 axis cumulated together). Another solution would be to compare the points of interests between the frames and trigger the creation of a new keypose when the number of points in common falls below a given threshold, but it is not given that the absolute number of points is representative enough to describe the amplitude of the move.

For each keypose, a node is inserted in the graph with an edge linking the new keypose with the previous one. A standard edge represents the rigid transformation between the two keyposes. Between each keypose, they may be a more than a couple of frames. To compute the corresponding transformation, the resulting matrix is found by multiplying all the intermediate matrices for each couple of frames.

\section{Loop closures}

The detection of loop closures can be translated into additional constraints in the graph, as illustrated in figure~\ref{fig:graph_overview}. Each time a loop closure is triggered between two keyposes, a new edge is inserted into the graph. Here, the detection of the loop closure is done by computing a transformation between the two frames, as described in section \ref{sec:transformation}. If there are enough inliers to make the transformation valid, a loop closure can be inserted.

Following the order of the sequence, the detection of a loop closure can be done by comparing the most recent frames with previous ones. The naive approach would be to check for the whole set of frames, comparing the current frame with all the previous frames. However, this can be highly time consuming, as the time necessary for the current frame would grow fast. For example, for 4 frames, there would be 6 checks to perform by looking in the past (respectively 1,2,3~checks for each frame). For $N$ frames, there would be $(N-1)*N/2$ checks to perform. Clearly, this can become an issue if the check is not fast enough. First, it can be optimized by storing the past features into a memory buffer. A preliminary check exclusively done on the RGB data, for example with color histograms, could also be used to discard most of the negative candidates. Then, a more accurate verification would involve the features for the remaining candidates.

In this work, for reason of performance, it was necessary to reduce the list of candidates to check. Without making any assumption on the past frames, the first idea is  to define a \emph{sliding window} with a fixed size of $k$ frames in the past. The loop closure is of interest when it closes a loop with frames that are sufficiently distant in the temporal sequence, as the goal is to reduce the drift over time. For this reason, from a given frame, it is more  appropriate to check with older frames in the past rather than most recent ones. By defining a number of excluded frames and the size of the sliding window, an efficient method can be implemented.

\begin{figure}[H]
\centering$
 \begin{array}{c}
 \subfloat[]{\label{fig:lc_sw_1}\includegraphics[width=0.8\textwidth]{figures/graph_lc_sw1}} \\
 \subfloat[]{\label{fig:lc_sw_2}\includegraphics[width=0.8\textwidth]{figures/graph_lc_sw2}}
 \end{array}$
\caption{Illustration of a sliding window of 5 frames, where the last 3 frames are ignored. \protect{\subref{fig:lc_sw_1}} For the current frame n.9, the frames 1-5 are checked. \protect{\subref{fig:lc_sw_2}} For the next frame n.10, the window slides one step forward. The frames 2-6 are checked.}
\end{figure}

However, these parameters are hard to define without a prior knowledge of the followed path. To extend the search, but still limiting the number of checks, these frames could be selected randomly in a much larger window, first assuming a constant probability density function. But the goal is to select the past frames that are more likely to match with the current frame, giving a higher priority to the oldest frames. This could be done defining a probability density function with a higher probability for the oldest frames. This probability could also be defined considering the knowledge of the context, in terms of features, or in terms of estimated position, preselecting the candidates with the highest likelihood.

Here, we limited the search by using a candidate list of $N$ frames, ignoring the most recent frames (the size can be set by a parameter), and a probability density function giving priority to the oldest frames. To select the $N$ candidate frames, the function giving the probability to select a candidate can be defined by $p(i) = i/S$, where $S=(N+1)*N/2$. For a sliding window of 3 frames, the first frame would have the probability 3/6, the second 2/6 and the third 1/6. These values can then be normalized for a total probability of 1.

To improve the quality of the loop closure detection, once a candidate frame is found, the check is extended to its neighbors, looking for a better transformation. The ratio of inliers is used to measure the quality of the transformations and to compare them. We consider only a forward search, meaning that the following neighbors will be checked, as illustrated in figure~\ref{fig:lc_candidate}.

\begin{figure}[H]
\centering
 \includegraphics[width=0.8\textwidth]{figures/graph_lc_candidate}
\caption{Loop closure with forward search for the pose $P_k$. First, a valid transformation is found with candidate C1. The check is then performed with the next frame C2, giving a better transformation. The search continues forward, and founds a better transformation with C3. Finally, the quality of the transformation with C4 is lower, so the process stops. The loop closure is inserted between $P_k$ and C3.}
\label{fig:lc_candidate}
\end{figure}

\clearpage
\section{Optimizing the graph}

Once the graph has been initialized with the poses and the constraints from the loop closures, it can be optimized. Various methods are available in $g^2o$, both for the optimization process and the solving problem. The method used here is Levenberg-Marquardt with a linear Cholmod solver. The optimization is done with a predefined number of steps. Once the graph has been optimized, the vertices are corrected and the new estimations of the camera poses can be extracted. Similarly to the insertion, the inverse operation is done to convert the information from the graph vertices to 4x4 matrices.

\begin{figure}[H]
\centering$
 \begin{array}{c}
 \subfloat[initial graph]{\label{fig:g2o_view_1}\includegraphics[width=0.5\textwidth]{pictures/graph4_base}} \\
 \subfloat[optimized graph]{\label{fig:g2o_view_2}\includegraphics[width=0.5\textwidth]{pictures/graph4_optimized}}
 \end{array}$
\caption{Visualization of the graph with the g2oviewer, \protect{\subref{fig:g2o_view_1}}~non optimized, \protect{\subref{fig:g2o_view_2}}~after 10 iterations, for a map of 4 rooms (experiments are described in chapter \protect{\ref{chap:experiments}}).}
\end{figure}
\clearpage

\section{Scene reconstruction}

Finally, once the camera poses are determined with a good belief, the reconstruction of the scene can be performed. For each frame, a 3D point cloud can be generated from the RGB and depth data, providing the colour information for each point. However, in order to reconstruct the whole scene, they have to be combined together from a unique point of view. This process is called the point cloud \emph{registration}. Once the camera positions are known, each point cloud is transformed by projecting all its points according to the corresponding camera position, given by the equation~\ref{eqn:pose_estimation}. This is done relatively to the first pose, which is predefined. The transformed point clouds are then concatenated together to build the scene.

This method is simple, the major issues are that it leads to duplicate points, and it does not take variances of illumination into account. Some considerations about memory are described in the experiments, in section \ref{sec:data_output}. However, the point cloud is not necessarily the best representation of the scene. For many applications, it would be preferable to define some dense surfaces, such as the \emph{surfels} presented in \cite{PfisterZBG00} and used in the work of Henry~\cite{Henry_RGBD_2010}, but this would require further study and processing and it is out of the scope of this work. 

\section{Summary}

In this chapter, we presented how the poses can be estimated, first by cumulating each transformation between a couple of frames. Through the use of a pose graph and the loop closures, their positions can be corrected over time. Finally, the map can be built from a scene reconstruction, by registration of the point clouds with the given positions of the cameras.

\chapter{Experiments}
\label{chap:experiments}

After studying the feature extraction and matching, the graph optimization and the reconstruction, some experiments could be done. A software was developed, which goal is to acquire data from the Kinect, compute the 6\gls{DOF} camera poses representing the trajectory and orientation of the robot, and produce a 3D map. The following sections describes how the input data was acquired, and how the output is generated. The next sections present the maps obtained from different datasets, first from KTH, for different sizes (1 room, 2 and 4 rooms with corridor). The last section shows the results obtained with data from other universities.

\section{System overview}

The program can perform the following tasks:

\begin{itemize}
\item acquires a stream of RGB-D data from the Kinect;
\item performs extraction and matching of SIFT/SURF features;
\item computes the relative transformations with \gls{RANSAC};
\item computes the initial poses and translates them into $g^2o$ nodes and edges;
\item detects the loop closures and inserts the corresponding edges into the graph;
\item optimizes the graph with $g^2o$ and extracts the updated camera poses;
\item reconstructs the global scene by generating a point cloud datafile (*.pcd).
\end{itemize}

The sequence of actions, which have been described in the previous chapters, can be summarized in the figure~\ref{fig:system_overview}. Note: the use of RGB and depth data is more detailed in figure~\ref{fig:system_overview_rgb_depth}. 

\clearpage
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1\textwidth]{figures/overview}
\caption{Overview of the system}
\label{fig:system_overview}
\end{center}
\end{figure}

In order to repeat the experiments with different parameters, the RGB-D stream is saved and the program is able to replay a given sequence from a list of files. Two types of experiments were conducted:
\begin{itemize}
\item on the fly, by moving the Kinect sensor by hand, with a feedback in real time;
\item offline, by recording a data stream with a mobile robot, and processing it afterwards. 
\end{itemize}

In the first case, the program performs the feature extraction and computes the transformations in real time. When the motion is too fast, or when the number of features is too low, the synchronization can be lost if no valid transformation is found between the last two frames. The sequence is then suspended and a message is displayed, so the user can move the camera back and lock again with the last valid frame, to resume the sequence. The detection of the loop closures, the graph optimization, and the reconstruction are done in a next step. Similarly, all the processing could have been done on the robot as well, but this would require further installations, and here the goal was to generate a set of reference data, as described in the section \ref{sec:data_acquisition}.

The main program was executed on a PC equiped with an Intel~Core\texttrademark{}~\hbox{i3--540}~CPU (3.06~GHz, 4MB~cache) and 3GB of RAM, running on Linux Ubuntu~10.04 (kernel~2.6.35.7). When the acquisition and matching are done on the fly, the system is able to process about 3-4~frames per second using \gls{SURF}, which is already enough to let the user walk slowly in a room, with a real time feedback. Most of the time is spent for the feature extraction (100-200ms). Some possible improvements are given in the final chapter with the conclusions.

\clearpage
\section{Data acquisition}
\label{sec:data_acquisition}

To build some reusable datasets, the experiments were carried out at \gls{CVAP} (KTH) on a Pioneer III wheeled robot, Dora the explorer (see figure~\ref{fig:dora}). The Kinect camera is mounted at 1.4 m above the floor, and the robot is also equiped with a Hokuyo URG laser scanner. Each frame saved by the Kinect leads to a couple of RGB and depth files taken at the same time. Additionally, the data from the laser scans and the odometry were saved, and then used to build a reference path, as shown in figure~\ref{fig:plan_cvap}. The acquisition of the raw data by the robot is not part of the main program built in this work, but with the tools that were developed at \gls{CVAP}. Clearly, this implies the format of the RGB-D data is specified.

\begin{figure}[h]
 \begin{center}
 \includegraphics[width=0.5\textwidth]{pictures/Dora_7587}
 \end{center}
\caption{Dora, the explorer}
\label{fig:dora}
\end{figure}

At KTH, the main dataset took place in environment with 4 different rooms connected by a corridor (see map in figure~\ref{fig:plan_cvap}, page ~\pageref{fig:plan_cvap}), resulting in 6474 frames (RGB-D datafiles) with several possibilities of loop closures. Other universities participating on common projects could provide data at the same format with similar robots. This allowed to test the system with data provided by the universities of Birmingham and Ljubljana.

\begin{table}[H]
 \begin{center}
  \begin{tabular}{llcc}
  \hline
  University & Dataset & Rooms & Frames \\
  \hline
  KTH & 7th floor & 1 & 3082 \\
  KTH & 7th floor & 2 & 2697 \\
  KTH & 7th floor & 4 & 6474 \\
  Birmingham & Lab3 & 1 & 1342 \\
  Birmingham & Robotic Lab & 1 & 2878 \\
  Ljubljana & Office1 & 1 & 2088 \\
  Ljubljana & Office2and3 & 2 & 5687 \\
  \hline
  \end{tabular}
 \end{center}
 \caption{Description of the different datasets. Note the KTH set of 1 room contains more frames than for 2 rooms, due to the robot moving at a slower speed.}
 \label{tab:datasets}
\end{table}

\section{Software implementation}

The code is available as an open source project\footnote{\url{http://code.google.com/p/kth-rgbd/}}. It is released with a LGPL v3 license. The number of dependencies has been limited to a reasonable set of open-source libraries. Unlike the RGBD-6D-SLAM project~\cite{engelhard11euron-workshop}, it is not based on \gls{ROS} framework, and \gls{PCL} is mainly used for the reconstruction at the final step. The main dependencies are the following:
\begin{itemize}
\item g2o: the graph optimization
\item OpenNI: to acquire the Kinect data
\item OpenCV: open source library used to visualize the intermediate results with bitmaps (frame matching) and compute the \gls {SURF} features
\item SIFT Library by Rob Hess~\cite{hess_sift}: for the \gls{SIFT} extraction and initial matching
\item Eigen3: for the geometric support with transformations and matrices
\item Point Cloud Libary~\cite{Rusu_ICRA2011_PCL}, standalone distribution (cminpack, Flann, Eigen3, OpenNI): for the transformations and the export to point cloud datafiles
\item Boost: support library, used here to access the filesystem
\end{itemize}

\clearpage
\section{User interface}
To use the program in real time, a simple graphical user interface was developed. The main window displays two frames with their features. The upper frame is the current frame, and the lower frame is the last valid frame. If the sequence is out of synchronization, meaning that no valid transformation between the last two frames, this layout allows the user to move back by comparing visually the current frame with the last valid one. Then, the sequence is resumed automatically as soon as there is a valid transformation. The quality of the transformation is also displayed, that corresponds to the ratio of inliers.

\begin{figure}[H]
\centering
\includegraphics[width=1\textwidth]{pictures/screenshot}\\
\caption{user interface, displaying the current and last frames with their features}
\end{figure}

\clearpage
\section{Data output}
\label{sec:data_output}

As a result, the program exports the positions of the cameras, and the map represented by a point cloud (\gls{PCL}). For reasons of performance, it is necessary to control the memory storage by limitating the size of the final point cloud. Theoretically, each point cloud can be composed of $640\times480 = 307,200$ points. Practically, the effective number of points is lower, as the depth information is not available for each point (due mainly to the range limit and to the occlusions considering the difference of point of view between the IR and RGB sensors). A standard point cloud is composed of about 200,000 points. For each pixel, 24 bits are used for the positions (x,y,z) and 24 bits for the colors R,G,B ($3\times8$ bits), padded to 8 bytes for alignment in memory. Therefore, a colored cloud point occupies around $200k\times8$ = 1.6MB in memory. For a scope of visualization, it is reasonable to reduce the amount of information. Each point cloud is first subsampled simply by removing points taken randomly with a given probability (according to a fixed ratio, set as a parameter depending on the global size of the scene).

Generally, only one output file is produced but this can still lead to a big file in case of a large map. In this case, the total number of points can be  important and a too high rate of subsampling would result in a low quality map. By setting a threshold for the size of the final point cloud file, it can be split in several subfiles each time the threshold is reached during the reconstruction of the global map. The result is then divided in different files. This allows to keep a low subsampling rate, and still to visualize a portion of the global scene with a good definition. For the visualization of the point clouds, the standard viewer provided in \gls{PCL} is used.

%In terms of performance, we compared SIFT and SURF on the largest dataset.
%
%\begin{table}[H]
% \begin{center}
%  \begin{tabular}{llcc|cc}
%  \hline
%  University & Dataset & Rooms & Frames & SIFT & SURF \\
%  \hline
%  KTH & 7th floor & 4 & 6474 & 2h24mn & 0h54mn \\
%
%  \hline
%  \end{tabular}
% \end{center}
% \caption{Performance using SIFT and SURF.}
%\end{table}

%\clearpage

\cleardoublepage
\section{Map at CVAP -- one room}

This map is built from a sequence containing one room (see table~\ref{tab:datasets}), the graph is optimized with one loop closure, which is triggered after having gone through the room and back, close to the initial position. 

\begin{figure}[H]
\centering$
\begin{array}{c}
\subfloat[from initial graph]{\label{fig:room1_1}\includegraphics[width=0.7\textwidth]{pictures/room1_initial}} \\
\subfloat[from optimized graph]{\label{fig:room1_2}\includegraphics[width=0.7\textwidth]{pictures/room1_optimized}}
\end{array}$
\caption{Map at CVAP with 1 room -- \protect\subref{fig:room1_1} Without graph optimization. \protect\subref{fig:room1_2} With loop closure. Note how the corner of the table and the chair appears to be doubled in the top figure, and how it is corrected in the bottom figure.}
\end{figure}

\begin{figure}[H]
\centering$
\begin{array}{c}
\subfloat[from initial graph]{\label{fig:room1_chair_1}\includegraphics[width=0.8\textwidth]{pictures/room1_chair_initial}} \\
\subfloat[from optimized graph]{\label{fig:room1_chair_2}\includegraphics[width=0.8\textwidth]{pictures/room1_chair_optimized}}
\end{array}$
\caption{Map with 1 room, details of the chair -- \protect\subref{fig:room1_chair_1} Without graph optimization. \protect\subref{fig:room1_chair_2} With loop closure. The reduction of the drift is clearly noticeable.}
\end{figure}

\clearpage
\section{Map at CVAP -- two rooms and corridor}

This map is built from a sequence containing two rooms separated by a corridor (see table~\ref{tab:datasets}). The graph is optimized with two loop closures, the first one with an internal loop in the living room, and the other one at the end of the sequence, after coming back in the first room close to the initial position. 

\begin{figure}[H]
\centering$
 \begin{array}{c}
 \includegraphics[width=0.75\textwidth]{pictures/room2_1}\\
 \includegraphics[width=0.75\textwidth]{pictures/room2_2}
 \end{array}$
\caption{Map with 2 rooms, optimized with two loop closures. The rooms are clearly displayed and correctly oriented. The main issue concerns the dividing wall in the corridor.}
\end{figure}

\clearpage
\section{Map at CVAP -- four rooms and corridor}

This map is built from a sequence containing four rooms separated by a corridor (see table~\ref{tab:datasets}), first without graph optimization. 

\begin{figure}[H]
\centering$
 \begin{array}{c}
 \includegraphics[width=0.8\textwidth]{pictures/room4_new_init1}\\
 \includegraphics[width=0.8\textwidth]{pictures/room4_new_init2}
 \end{array}$
\caption{Map with 4 rooms and corridor, not optimized. The main issue that can be noticed at this scale is the relative orientation of the first room (first figure, bottom right), which looks misaligned, and its desk appears to be doubled.}
\end{figure}

\clearpage
For the following map, the graph is optimized with four loop closures, triggered with a loop in three rooms room and the other one at the end of the sequence, after coming back in the first room close to the initial position.

\begin{figure}[H]
\centering$
 \begin{array}{c}
 \includegraphics[width=0.75\textwidth]{pictures/room4_new_optim1}\\
 \includegraphics[width=0.75\textwidth]{pictures/room4_new_optim2}
 \end{array}$
\caption{Map with 4 rooms and corridor, optimized with several loop closures. The alignment of the first room looks better (first figure, bottom right), but the living room (bottom center) shows some misaligned frames. This is a trade-off with respect to the global error.}
\end{figure}

\begin{figure}[H]
\centering$
 \begin{array}{c}
 \subfloat[]{\label{fig:poses_4r_1} \includegraphics[width=0.5\textwidth]{figures/poses_4r_odometry_raw}} \\
 \subfloat[]{\label{fig:poses_4r_2} \includegraphics[width=0.5\textwidth]{figures/poses_4r_initial}} \\
 \subfloat[]{\label{fig:poses_4r_3} \includegraphics[width=0.5\textwidth]{figures/poses_4r_optimized}}
 \end{array}$
\caption{Estimation of the poses -- \protect\subref{fig:poses_4r_1} From standard odometry, raw data. Note the drift, as the start and end points should be pretty close. \protect\subref{fig:poses_4r_2} With this system, initial graph. \protect\subref{fig:poses_4r_3} With the optimized graph. There is a slight variation.}
\end{figure}

\clearpage
To analyze the quality of the localization, we compared the resulting path with one close to the ground truth, generated by a \gls{SLAM} system developed at \gls{CVAP}, based on \gls{EKF} using the laser scan data~\cite{Folkesson_07}.
\begin{figure}[H]
\centering
\subfloat[]{\label{fig:mappath_1}\includegraphics[width=0.75\textwidth]{figures/cvap_7th_4r_sift}} \\
\subfloat[]{\label{fig:mappath_2}\includegraphics[width=0.75\textwidth]{figures/cvap_7th_4r_surf}}
\caption{Trajectory overlayed on the reference map of the 7th floor, using \protect\subref{fig:mappath_1} SIFT features. \protect\subref{fig:mappath_2} SURF features. In both cases, the drift is noticeable, but the results are relatively close. The total distance gives an indication of the accuracy.}
\label{fig:plan_cvap}
\end{figure}

\begin{figure}[H]
\centering$
 \begin{array}{c}
 \includegraphics[width=0.9\textwidth]{figures/drift_y_sift} \\
 \includegraphics[width=0.9\textwidth]{figures/drift_y_surf}
 \end{array}$
\caption{Analysis of the vertical drift. The Y-coordinates have been shifted to the origin for the starting position. Considering the camera is fixed on a support, the value should stay close to zero. This issue needs to be further investigated.}
\end{figure}

\clearpage
\section{Map from other universities}

For this map, the \emph{Lab3} dataset provided by the University of Birmingham was used. No particular issues were encountered.

\begin{figure}[H]
\centering$
 \begin{array}{c}
 \includegraphics[width=0.8\textwidth]{pictures/bham_lab3_1} \\
 \includegraphics[width=0.8\textwidth]{pictures/bham_lab3_2}
 \end{array}$
\caption{Birmingham Lab3}
\end{figure}

\clearpage
For this map, the \emph{Robotic Lab} dataset provided by the University of Birmingham was used. An issue was raised due to the people moving during the passing of the robot, leading to some inaccuracies in the transformations. The system is not able to handle such situations, affecting the feature matching, and therefore the overall quality of the map.

\begin{figure}[H]
\centering$
\begin{array}{c}
\includegraphics[width=0.8\textwidth]{pictures/bham_robotlab_1} \\
\includegraphics[width=0.8\textwidth]{pictures/bham_robotlab_2}
\end{array}$
\caption{Birmingham Robotic Lab}
\end{figure}

\clearpage
For this map, the \emph{Office1} dataset provided by the University of Ljubljana was used. It could only be processed partially, the sequence was interrupted for a lack of features. This is one intrinsic limitation of the method presented in this work.

\begin{figure}[H]
\centering$
 \begin{array}{c}
 \subfloat[]{\label{fig:ljub1_1} \includegraphics[width=0.7\textwidth]{pictures/ljub_office1}} \\
 \subfloat[]{\label{fig:ljub1_2} \includegraphics[width=0.7\textwidth]{pictures/ljub_office1_feature}}
 \end{array}$
\caption{Ljubljana office1 -- \protect\subref{fig:ljub1_1} Map without closure loop. \protect\subref{fig:ljub1_2} Not enough features could be extracted at this point. This occurs typically when flat surfaces are predominant in the scene. }
\end{figure}

\clearpage
For this map, the \emph{Office2and3} dataset provided by the University of Ljubljana was used. As for the previous dataset, it could only be processed partially, the sequence was interrupted for a lack of features.

\begin{figure}[H]
\centering
\includegraphics[width=0.8\textwidth]{pictures/ljub_office2and3}
\caption{Ljubljana office2 -- map without closure loop}
\end{figure}

\section{Summary}
These experiments show that 3D maps can be built with this system and illustrate the interest of the loop closure to get a good correction. Though the results are satisfying for small maps, there can still be a noticeable drift as the size of the map grows. Here, some misalignments can be perceived with 4 rooms. The most obvious limitations were also encountered, showing that the process can be interrupted as soon as there are not enough visual features to detect a valid transformation, and how moving elements can alter the final result.

\chapter{Conclusions and Future Works}
\label{chap:conclusion}

With this work, we could have a good insight into the general problem of \gls{SLAM}, and more specifically of \gls{VSLAM} with feature extraction and matching. The approach can be described by the following steps:
\begin{enumerate}
\item Feature extraction: \gls{SIFT}/\gls{SURF} extraction in 2D, preselection with depth;
\item Feature matching: nearest neighbors on the feature space, using kd-trees;
\item Transformation: computed with \gls{RANSAC} on the 3D keypoints;
\item Graph optimization: based on the $g^2o$ framework, a better graph can be computed with new constraints by detecting loop closures;
\item Scene reconstruction: registration of the points clouds accordingly to the estimation of the keyposes, and concatenate them to build the global map.
\end{enumerate}

Given this, it is now possible to imagine some improvements of the current work, both in terms of quality of the results and overall performance.

\section{Quality improvements}

\begin{itemize}
\item Finding a better transformation: the \gls{RANSAC} method returns a rigid transformation that is not necessarily the best. The parameters may be tuned in a more efficient way to give more accurate results, or defined with a variant like MLESAC \cite{TorrZ00}. The tuning also concerns the initial matching done with the \gls{SIFT}/\gls{SURF} features. The \gls{ICP} \cite{zhang_92_icp} method should give a refined solution. The output of the \gls{RANSAC} loop could be used as an initial guess for \gls{ICP}. 
\item Graph optimization: it could be improved by formulating more accurately the optimization problem and refining the minimization method (possibly with~$g^2o$ by deriving new classes and defining appropriate error functions), and also by exploiting the possibilities of modularity, for example with hierarchical levels corresponding to different scales for bigger maps. New criteria could be defined for loop closures, by introducing some knowledge about the past frames (features or estimated position more likely to give a loop closure with the current frame) or by matching only a portion of scene that does not necessarily give a valid transformation in the current system.
\item Additional sensors: in this work, the only sensor used was the Kinect providing RGB-D data. One possibility would be to use others sensors and combine them to get a better result in a multi-modal or hybrid system, especially when the transformations computed by the method presented in this work are less reliable.
\end{itemize}

\section{Performance improvements}

Currently, most of the time is passed on the feature extraction. Here are a few alternatives :
\begin{itemize}
\item GPU implementation: a feature extraction optimized for the hardware of the graphic cards would dramatically lower the computational time to enable real-time applications. Various GPU implementations exist for both SIFT and SURF. In addition, some solutions take advantage of the parallel computing on specific platforms, such as CUDA\texttrademark{} for the NVIDIA graphic cards. The features extraction could then be performed in 15-20ms~\cite{bjorkman}.
\item BRIEF: combined with a keypoint detector such as \gls{SURF} or \gls{SIFT}, the \gls{BRIEF} descriptors~\cite{Calonder10-brief} could give an interesting alternative, being more efficient to compute. However, this would imply to rewrite the matching code, by doing binary string comparison. See also the PhD thesis of Calonder~\cite{Calonder10_PhD}.
\item ICP: though the standard algorithm could increase the computational time, some efficient variations exist~\cite{Rusinkiewicz_2001}. 
\end{itemize}

\section{Other approach}

With the KinectFusion project, Microsoft Research presents a solution performing reconstruction without any feature, though it achieves real-time performance as demonstrated\footnote{KinectFusion: \url{http://www.youtube.com/watch?v=quGhaggn3cQ} (last access: Jan 2012)} at SIGGRAPH~2011. The camera tracking is based on a GPU implementation of \gls{ICP} on the point clouds generated from the depth data. The related papers \cite{Izadi_2011_SIGGRAPH} \cite{Newcombe_2011_ISMAR} describe how an accurate dense reconstruction can be performed in real-time, which sounds promising in the future for many applications related to computer vision, not only mapping but also segmentation and object recognition. A new version of the Kinect camera working at shorter range will also extend the possibilites.
